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Abstract. In this paper we present a modified dqds algorithm (mdqds) for computing all of the 
singular values of a bidiagonal matrix to high relative accuracy. There are two key components in our 
modified algorithm: a novel bisection-type strategy that ensures worse case linear complexity, and 
an improved shift strategy that accelerates convergence for most bidiagonal matrices. Our extensive 
numerical experiments indicate that mdqds is typically 1.2x-4x faster than xLASQ (the LAPACK 
implementation of dqds) without any degradation in accuracy. On matrices for which xLASQ shows 
very slow convergence, mdqds can be 3x-llx faster. 
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1. Introduction. The dqds algorithm was introduced in [5] in 1994 as a way to 
compute singular values of bidiagonal matrices to high relative accuracy. A detailed 
account of how it can be implemented to become a public domain tool rather than a 
research procedure was published in 2000 in [T3] , which essentially documents the code 
used in LAPACK. Devise for fast execution on all the varied matrices available in 2000 
ensured that the shift strategy was more complicated than customary in the domain 
of eigenvalue algorithms. In |12j . a new version employed the idea of aggressive early 
deflation to enhance performance on large matrices. 

Despite all this effort there are blemishes in the LAPACK version, the most serious 
being significant slow down on certain matrices which have diagonal and off-diagonal 
entries that are of varying magnitude but are disordered (far from monotone decreas- 
ing). This paper supplies remedies for the known defects of the current LAPACK 
version. 

Specifically our contributions are 

1. Ability to deflate converged singular values that are far from the bottom of 
the matrix; 

2. An improved bound on the current smallest singular values; 

3. Modification to certain shift strategies; 

4. Guarantee that convergence is never slower than for bisection. 

Overall, mdqds is 1.2x-4x faster than dqds in general and up to llx faster for 
matrices for which dqds converges slowly. It should be mentioned that the dqds 
algorithm has found other uses than computation of singular values and the ideas in 
this paper may be of value in those situations as well [5] . 

From a theoretical point of view, mdqds is never slower than bisection in conver- 
gence, which establishes its linear complexity in the worst case. This is very important, 
since it ensures the robustness of implementation of mdqds algorithm. There is a sim- 
ilar result for the QR algorithm with Wilkinson shift for the symmetric tridiagonal 
eigenproblem [15] , 
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2. Essential information on dqds. In this section we gather essential prop- 
erties of dqds. The reader is expected to have some acquaintance with the algorithm. 
For an introduction to the topic we recommend |13j and for historical details [B]. By 
convention our bidiagonal matrices are all upper bidiagonal. Our notation is known 
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D 



a i 
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(2.1) 



In addition to dqds we refer to a simpler procedure oqd (orthogonal qd) that trans- 
forms B into B so that 



BB T = B T B, 



see [BJ. 



ALGORITHM 1: oqd algorithm 

5i = oi; 

for k=l, n-1,1 do 

ak \= y/a% + b\; 

bk := b k - (dfc+i/dfc); 

2fc+i := 5fe- (afe+i/fifc); 
end 

&n ■ — ^n; 



The bare dqd algorithm is obtained by squaring the variables in oqd, 

q k =a 2 k ,e k = b 2 k ,k = 1,2,- ■■ ,n,b n = 0. (2.2) 

However shifts may be used in dqd to obtain the dqds algorithm (differential quotient 
difference with shifts). The shift is s. By definition, dqd=dqds(s = 0). 



ALGORITHM 2: dqds algorithm 

di = qi — s; 

for k=l, n-1,1 do 

qk ■= d k + e k ; 

e k := eu- (q k +i/qk); 

dk+i = dk- (<?fe+i/<?fc) — s; 
end 

q?i - — dn , 



The point of introducing oqd is to exhibit the relation of the q, e variables to the 
a, b variables of B. However an alternative connection is to define 
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and then a little algebra shows that dqds implements the transform 



LU = UL- si. 



An essential ingredient in the success of dqds is that it never explicitly forms any 
products UL or LU but works entirely with the factors. In fact, LU and B T B are 
connected by a diagonal similarity transformation. 

The careful observer will see that the dqds transform is a non-restoring similarity 
transformation on UL: 



All eigenvalues are reduced by s. The dqds algorithm is the repeated application of 
the dqds transform with a sequence of well chosen shifts. The algorithm checks, before 
each transform, whether e„_i or e n is negligible. If so then one or two eigenvalues 
have been computed and the computation resumes on a smaller array. This is called 
deflation. 

For experts in the QR algorithm the non-restoring feature (of dqds) is trouble- 
some. For dqds convergence always means making UL singular. Thus it is essential 
to update the sum S of all shifts used so far. 

In the 1960's Kahan proved that a bidiagonal matrix define all its singular values 
to high relative accuracy, see pQ. In the context of dqds this precious feature will be 
preserved if all the q, e variables remain positive. Any zero values must be taken care 
of specially. Thus the shift s must obey 



This constraint means that the singular values must be found in monotone increasing 
order. This is in contrast to QR with shifts but QR does not claim to compute 
eigenvalues to high relative accuracy except when the shift is zero. 

The (q, e) variables require so little storage that it is sensible to compute (q, e) 
in separate location from (q, e), and then decide whether to accept or reject the 
transform. If any new variables are negative the algorithm rejects the transform and 
choose a smaller shift. This is called a failure. The ability to reject a transform 
permits an aggressive shift strategy 

A valuable feature of dqd is that at each minor step k, d~T =[{UL)~ l ]kk- There- 
fore d m in = minfc dk > \ m in{UL). Asymptotically d m - ln is a very good estimate of 
^■min(UL) but in the early stages it may be too big. There is more on this topic later. 

A well known feature of QR and LR algorithms is that the larger entries migrate 
gradually to the top of the matrix and the smaller ones precipitate to the bottom. 
The smaller they are the quicker they fall. 

When e k is negligible and k < n — 2 then matrix splits and it is vital to detect 
splits as will be discussed later. The criteria for negligibility have received careful 
attention and tests can be complicated. 

2.1. Some theoretical results. Sometimes we use BB T notation and some- 
times we will use UL notation to describe dqds algorithm. 

Theorem 2.1 (Theorem 2 in [6]). Apply the dqd transform to a positive bidiag- 
onal matrix B and d\, ■ ■ ■ ,d n are the intermediate values of d. Then 



UL = L-\LU)L = L- X (UL - sI)L. 



S < X m in(UL). 



2. [{BB T )-\. k 

3- ELO" 1 



1. a 2 n < mm k {d k } 
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The LAPACK code records d m i n as a guide to choosing a shift. 
Corollary 2.2. With the notation in Theorem \2.1\ we have 

°n < d min < na 2 n 
1 a (2-3) 

dmin —^n — dmin 

n 

Corollary |2.2|implies that d m i n becomes negligible when the matrix becomes nearly 



singular. It is the basis of our deflation Strategy One in section 4.1.1 There are 
similar results for dqds algorithm. 

Theorem 2.3 (Theorem 3 in [6]). If the dqds with shift s > transforms positive 
bidiagonal matrix B into positive B with intermediate quantities dx, ■ ■ ■ ,d n . Then 

1. o-l(B) < mmfc{4} 

2. [{BB T )-\ M < d^ 1 

Corollary 2.4. 

-d mm «J 2 n {B) < d min < n<j 2 n {B). (2.4) 
n 



These two corollaries have very important practical implications. 

• A good shift should not exceed d min and should not go below ^s^a- , and thus 
should always be in [ d ™ n , d m in\- 

• A very small cr^(-B) automatically implies a very small d m i n . If dk = d m - ln , 
k <C n, and is also negligible it is possible to deflate the matrix, see sec- 
tion gTT] 

3. Features of the current Lapack implementation. The algorithm pre- 
sented in [H] was designed for speed on all the test matrices available before 2000. It 
was averagely 5x faster than the Demmel-Kahan QR code executed without accumu- 
lation of the plane rotations [14]. The drive for efficiency, as distinct from elegance, 
produced a more complicated shift strategy than used in previous LR/QZ tridiaog- 
nal codes. Here is one example, by unrolling the last 3 minor steps, the code keeps 
values d n -2, d n —x and d n available so that, in the asymptotic regime, there is enough 
information to make a good approximation to the new a m i n even after one or two 
eigenvalues arc deflated. For example, d n -% might be d m i n for the deflated array. 

The code keeps the information for L, [/, L and U. If any entry in L or U is 
negative the transform is rejected and the step is declared a failure and a new (smaller) 
shift is constructed. Consequently an aggressive shift strategy may be employed. The 
number of failure is tried to be kept below 2 or 3% to reap the benefits of the aggressive 
shift strategy. 

To enhance data locality the variables are held in one linear array 
Z = {qi,ei,qi,ei,q 2 ,e 2 , q 2 , e 2 , ■ • • , q n }. 

3.1. The Blemish in xLASQ. The implementation of dqds in LAPACK, as of 
2000, is based on the following perception. 

"An explicit conditional statement (if-then-else) in an inner loop im- 
pedes efficient implementation on a pipelined arithmetic unit." 
Consequently the division in the dqds loop, t — qk+i/cjk, is not protected from 
incurring an exception (divide by zero or overflow). However, the powerful feature of 
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arithmetic unit conforming to IEEE floating point standard 754 is that computation 
is not held up by an exception. At the end of the loop the code tests whether an oo 
or a NaN (not a number) occurred and acts appropriately. 

The LAPACK implementation also assumes that a good compiler will implement 
intrinsic FORTRAN functions such as MIN(A,B) or MAX(A,B) efficiently. Hence the 
valuable variable d min ('■= min fc dk) is computed via dmin — MIN(g?, c?mm) in the inner 
loop. No explicit conditional. 

Consequently, it was decided to give up knowledge of the index at which d m i n 
receives its final value. This would require an explicit conditional, 
if d < dmin then 
index_dmin=k 

dfnin — d 

end if 

This decision was a strategic error. 

Recently, it has been observed that this explicit conditional statement need not 
impede performance provided that it is placed, not in its natural position, but imme- 
diately after the division. Technically this requires us to update d m i n one minor step 
late but that is easily dealt with. It is because division is so slow relative to other op- 
erations, such as comparison, that the conditional statement can be completed before 
the preceding division division finishes. Here is the new inner loop (in pseudo-code) 
for k — 1 to n — 1 

a_k = d k + e k ;t = q k+ i/q k ; 
if d k < then 

exit (early failure) 
else 

if d k < d min then 

index_dmin=A:; d m i n = d; 
end if 
end if 

efc = e k -t; d k+ i = d k - t — s; 
end for 
if d n < then 

late failure 
else if d n < d m i n then 

index_dmin=n; d m i n = d n ; 

end if 
end if 

Comments: Late failure is a place holder here. Our algorithm does not exploit 
the specific properties of late failure. 

4. Improvements. In this section, we summarize the improvements of our im- 
plementation over xLASQ in LAPACK-3.4.0, which can be divided into two types: 
deflation strategies and shift strategies. 

4.1. New deflation strategies. 

4.1.1. Setting negligible d k to zero. Recall that the dqds algorithm is non- 
restoring. The algorithm tries at every step to make the current matrix singular. 
Recall that the positive qe array defines both a bidiagonal matrix B and the matrices 
L, U, see [13] . It can happen that a leading principal submatrix of B becomes almost 
singular long before any negligible entries appear at the bottom of the matrix. This 
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situation is not easily detected by a simple inspection of the entries {a k ,b k }, or say 
Uk,e k }. 

To explain, it is best to go back to the oqd transform, for B T to B, and consider 
the process after [k — 1) minor steps shown in the following equation, 



B« = Q k B T 



a 2 b 2 
• 

a fc _i 




bk-i 
a k 
b k 





afc+i 
bk+i flfe+2 



b n -i a rt 



(4.1) 



The striking feature is that row A; is a singleton, its entry is d k and a\ — d k . Rows 
(1 : k) are upper bidiagonal, rows (k + 1 : n) are lower bidiagonal. Such matrices are 
often described as twisted. How small must d k be to declare it negligible ? Define 



new matrices B and E, using equation (4.1 1, by 



B^ = Q k B T = B + E, 

where row k of B is null and E = diag(0, • • • , 0, a k , 0, • • • ,0). Observe that EB = 
E T B = . Hence, 

B^ T B^ = (B + Ef(B^ +E) 

= B T B + B T E + E T B + E T E (4.2) 
= B T B + E T E. 



By Weyl's monotonicity theorem, for i = 1 : n, 

k 2 -a, 2 |<||£ 2 ||=4. (4.3) 

Here {of} are the ordered singular values of B. 

The non-restoring character of the dqds algorithm entails that the desired eigen- 
values are the {of + S} where 5* is accumulated sum of the shifts so far. Consequently, 
d k (and a k ) may be set to zero, when 

d k <eS <e(S + af). (4.4) 

In order to exploit such an event more work must be done. Examination of the 
inner loop of oqd shows that with d k = , the algorithm simply moves the remaining 
variables into new positions: dj = bj, bj — a^+i, j = k : n — 1, d n = d k = 0. Of course, 
it is more efficient to avoid the arithmetic in oqd and simply move the variables. 

The bidiagonal B reveals its singularity but deflation requires that b n -\ also 
vanish. More work needs to be done since b n -\ = a n may not be negligible. There 
are two options. 

Option A. Apply the oqd transform to B to obtain B and note that 6„_i = 
i) n _i(a n /a n _i) = 0, a n = a(a„/a„_i) = 0. The new singular value is S + d n = S and 
n <— n — 1. In practice it would be done by dqd transform. 
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Option B. The procedure invoked in the aggressive early deflation algorithm, 
see |12j . is useful here. In our implementation we use this option since we found it 
usually saves some flops than Option A. We describe it briefly. 

Apply a carefully chosen sequence of plane rotation on the right of B to chase 
the entry b n -i up the last column of B. The sequence of 'planes' is (n — l,n) , 
(n — 2, n), (n — 3, n), ■ ■ ■ , (1, n). The single nonzero entry in column n is called the 
bulge. Its initial value is 6 n _i and it shrinks as it rises up the column. The expectation 
is that the bulge will become negligible quickly. It turns out that the criterion for 
neglecting the bulge j3 is the same as the criterion above for neglecting dk, namely 

1 < eS, 

and 1 is expressible in terms of {qi, e.;}. The whole algorithm is described as follow- 
ings [H]. 



ALGORITHM 3: Option B 

x = e(n — 1); 

for k = n — 1, 2, —1 do 

ti = qk ; 

Qk = Qk + x; 

ti = 1/g/c; 

x = X' e k -i-t 2 ; 

Ck-i = efc_i- ti-t2', 

if x is negligible then 
break; 

end 
end 

if h=l then 

qi = qi + x 
end 



Remark: When the bulge is in position (k,n),k < n — 1, then the matrix may 

o o 

be written as B + E where E — fiek^n an d B's last row and column are null. Then 

{B + E)(B + E) T = BB + /3 2 e k el. In Algorithm^ x = /3 2 . The strategy, setting 
negligible d m i n to zero, is a deflation strategy and would also be called Strategy One. 

4.1.2. Improved Criterion for deflation. An adequate, but crude, criterion 
for setting (in B) to zero, is 

< cea min (B). 

In the context of a qd array this same criterion is 

e n _! < {cefS < (ce) 2 a min (B) 2 . (4.5) 

This is what is used in xLASQ. A more refined criterion arises from considering the 
trailing 2x2 submatrix of BB T , 
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By Weyl's theorem no eigenvalue of BB T changes by more than b n -\a n if the 

2x2 matrix ( , ^ ^»-i a n | - g subtracted from trailing 2x2 submatrix shown 
\b n -ia n J 

above. However the (n — l,n — 1) entry of BB T still involves b\_ x . No eigenvalue 
of i?B T can change by more than b 2 l _ 1 if it is neglected. We can set it to zero when 
it is negligible compared to either S + X m i n (BB T ) or a 2 n _ 1 . Consequently the crude 



criterion (4.5 1 may be replaced by the following pair of tests: 

b 2 l _ 1 < cemax(S, a^_ x ) and b n -\a n < ce(S + A min (i3B r ). 

In the context of a qd algorithm, we use 

e„_i < cemax(S', q n -i) and e n -xq n < (ceS) 2 . (4.6) 

The same argument gives us a more refined test for splitting B whenever e/. and 
e k qk+i are negligible. 

4.2. Some new shift strategies. 

4.2.1. Updating the upper bound. In his original papers on the qd algo- 
rithm, Rutishauser proposed updating upper and lower bounds, sup and inf, on a%^ n 
at all times. The original paper [B] on dqds followed the recommendation but the 
implementation |14j omitted to update sup when a transform failed. 

A transform L, U into L, U fails if any entry in L or U is nonpositive (but u n = 
is permitted). In this case the connection to a bidiagonal B is lost. It can only 
happen if the shift s exceeds c m i n (-B) 2 and thus sup can be set to s. This is valuable 
information. 

Here is the pseudo-code, with s being the shift. 
If shift s succeeds, 

sup=min{d mi „, sup-s}, 
else if shift s fails, 

sup=min{s, sup}, 

end if 

In case of failure the transform L and U is discarded. In choosing shift s we 
replace xLASQ's d mm by our sup. 

4.2.2. Twisted shift in last p rows. When d m i n is near the bottom, we use 
twisted factorization to choose a shift. We call this shift strategy restricted twisted 
shift strategy since it is only used when d m i n is in the last p rows. Our experiments 
suggest p = 20 is a good choice. A similar strategy is used in xLASQ (Cases 4 and 
5), where it is invoked when d m i n is in the last two rows. 

The idea behind this shift strategy is to compute an approximate eigenvector x of 
BB T by focusing on the submatrix around d m i n . Using x and Rayleigh quotient resid- 
ual, we can compute a lower bound j3 on the smallest eigenvalue with high probability 
that /? < <j min (B), see Section 6.3.3 of [TJ] for details. 

With techniques similar to those in [2] , we can show that the order of convergence 
for the twisted shift strategy is 1 + ^ ~ 1.7, faster than the Johnson shift which leads 
to order of convergence 3/2 , see [2] for details. 

Theorem 4.1. Assume the bidiagonal matrix B has positive nonzero entries. 
For the dqds algorithm with twisted shift strategy, the sequence {ejJl 1 }jf.Q converges 
to with order of convergence 1 + ^ . 



We leave the proof of Theorem 4.1 to the Appendix 
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4.2.3. Using suitable 2-by-2 (Kahan). Prof. Kahan [9] suggests a better 
method to compute the shift upper bound that requires little overhead [S]. The idea 
is that the smallest singular value of the submatrix around d m i n should be a better 
upper bound than d m i n . 

Let us consider the odq algorithm first. Let k be the index where dk = minj(aj), 



and assume k > 1, see equation (4.1). By the interlacing property of singular val- 
ues [5] , we know the smallest singular value of the leading principle k x k submatrix 

of B^ is larger than <x m ; n (.B). Similarly the smallest singular value of " fc -i ^fe-i 

|_ o-k 

is an upper bound of a m in{B). This claim is also valid for dqds algorithm. 

5. Finite step convergence property. In this section we justify our claim of 
worst case linear complexity of our modified algorithm. This property has allowed us 
to implement our algorithm without fear of early termination or non-convergence. 

Theorem 5.1. For an n X n bidiagonal matrix B with positive elements, our 
algorithm can compute each singular value in at most 

_ r log(n/e) 
'logW 

sweeps, and in total it requires at most about 0(nk) sweeps to find all the singular 
values. 



Proof. We initialize sup = d m i n after one dqd or dqds sweep. By Corollary 2.2 
we know sup < na^^B) 



2.4 



Denote the upper bound after kth iteration by b k , b — sup. Since we use b k to 
choose a shift and every sweep makes it smaller at least by (1 — a)b k (a=3/4 for 
example), after one dqds sweep 

b k+1 < ab k < a k na 2 mm (B). 

When it has converged, b k+1 should satisfy b k+1 < eer^ in (-B). Let ct k naf nin < 
ea min(B)> we get k > ^^xjaj ■ ^ ^ nas conver g e d but the matrix is not deflated, 
by Corollary |2.2| or [2~4} we can deflate it by setting negligible dk to zero. □ 



In our implementation we choose a = § and e = 1CT 16 . If n = 1000, it requires 
at most about 153 iterations to compute the smallest singular value. 

We emphasize that Theorem |5.1| only provides the worst case analysis. In the 
absolute majority of cases, our algorithm converges much faster, in the range of about 
10 iterations per singular value. However, there are some bidiagonal matrices for which 
it indeed requires 0(k) sweeps to find the smallest singular value. The situation is 
very similar to the zero-in algorithm and its recent variant for finding zeros of a general 
univariate non-linear equation (see [HE]). 

6. Numerical Results. We have implemented our algorithm in Fortran by in- 
corporating our new deflation and new shift strategies into the current xLASQ rou- 
tines. All experiments were performed on a laptop with 1.37G memory and 1.6GHZ 
AMD Turion 64 X2 Mobile TL-50 processor. For compilation we used the gf ortran 
compiler and the optimization flag -03, and linked the codes to BLAS and LAPACK. 

6.1. Comparing each new technique. Recently, we have been able to con- 
struct a number of bidiagonal matrices of different dimensions for which xLASQ con- 
verges so slowly that it requires more iterations than allowed in the code. In these 
matrices, the diagonals are highly disordered and both the diagonal and off-diagonal 
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Table 6.1 
The bidiagonal matrices 





n 


Description of the bidiagonal matrix B 


Sourse 


1 


544 


Matrix_l 


m 


2 


1000 


Matrix_2 


nn 


3 


1087 


Matrix_3 


m 


4 


1088 


Matrix_4 


m 


5 


5000 


random bidiagonal matrix 





Table 6.2 
The speedup ( *%^ Q) ) 



Matrix 


n 


xLASQ 


Ml 


M2 


M3 


M4 


M5 


1 


544 


1.00 


1.93 


2.00 


2.94 


3.16 


3.83 


2 


1000 


1.00 


1.84 


1.91 


2.68 


2.82 


3.30 


3 


1087 


1.00 


5.22 


5.68 


9.11 


9.47 


10.49 


4 


1088 


1.00 


2.00 


2.12 


3.03 


3.43 


3.81 


5 


5000 


1.00 


1.05 


1.05 


1.08 


1.07 


1.16 



entries are of massively varying orders of magnitude. In this section, we use these 
tough matrices and a random matrix to show the improvements of our modifications. 
There are 5 main modifications over xLASQ. We use the following notation to denote 
them. 

• Ml: setting negligible dk to zero; 

• M2: new deflation strategies; 

• M3: updating the upper bound; 

• M4: twisted shift in last 20 rows; 

• M5: using suitable 2-by-2 submatrix (Kahan); 

Tables [6~2| and |6 . 3| show speedups over xLASQ in terms of floating point operations 
(flops) and time. The flops are measured by the number of divisions, which are the 
dominant costs. From the tables we can see that mdqds performs far fewer iterations 
and is about 3x-llx faster for these tough matrices. The average iterations for each 
singular value of these matrices are shown in the Table |6.4| 

This superior performance of mdqds is largely due to Ml, since it can deflate 
40% — 80% percent of the singular values (see Figure 6.1(b)| ) Ml is almost 'tailor- 
made' for such tough matrices. It is also very interesting to note that Ml usually can 
deflate about 10% of all singular values for a general bidiagonal matrix. 

For the Matrix_3, Figure 6.2(a) shows the locations of negligible d m i n when finding 
the first 500 singular values and about 70% of singular values of this matrix are found 
by Ml. For this matrix, our final algorithm has about llx speedup in time. We 
compare the number of iterations when finding the first 10 singular values of Matrix_3. 
The results are shown in Figure 6.2(b) From it we can see that our algorithm deflate 
one singular in about 13 iterations, while xLASQ takes about 130 iteration per singular 
value. 

The improvements M2-M5 aim to accelerate the convergence for general bidiago- 
nal matrices. From Table [6~2| and Table [673] , we can see that each of these techniques 
improves the performance. Since different matrices may have very different proper- 
ties, we can not expect these techniques to help a lot for all matrices. To show their 



dqds algorithm 



11 



Table 6.3 

The tveedvv f Tim < xLA SQ) ) 
1 ne speeaup ( Timl ,i M i) > 



Matrix 


n 


xLASQ 


Ml 


M2 


M3 


M4 


M5 


1 


544 


1.00 


1.93 


2.07 


3.41 


3.63 


3.87 


2 


1000 


1.00 


1.82 


1.88 


2.95 


2.95 


3.65 


3 


1087 


1.00 


4.94 


5.57 


9.17 


9.57 


10.99 


4 


1088 


1.00 


1.93 


2.04 


3.30 


4.19 


4.56 


5 


5000 


1.00 


1.04 


1.04 


1.13 


1.12 


1.19 



Table 6.4 

The average iterations for each singular value 



Matrix 


1 2 3 4 5 


xLASQ 
mdqds 


45.4 26.0 74.8 30.7 10.5 
11.65 7.31 7.57 8.71 7.89 



performance, we further use all the matrices in stetester to test each improvement 



by adding them one by one. Figure 6.1(a) shows the speedup of M5 over xLASQ 



There are totally 405 matrices in stetester. We use the number of divisions to 
measure the improvement, which is better than using the number of iterations and 
time. Each technique is helpful for most of these matrices; due to the limit of space, 
we do not include all these results. 

6.2. Some more tests. To further show the improvement of M5, we use some 
more matrices to test our algorithm. These matrices can be divided into two classes: 
1) matrices from application; 2) matrices constructed for testing dqds algorithm. 

These matrices are illustrated in Table |6.5| The last four are from industrial ap- 
plications which are collected in LAPACK tester stetester [TU] and can be obtained 
from Osni's homepage. The sizes of these matrices are not very big except the seventh 
and eighth matrix. We use these two big matrices to show our algorithm has nearly 
the same performance independent of the size of matrices. 

The results are shown in Figure 6.3(a) and Figure 6.3(b)| Figure 6.3(a) shows 



the speedup over xLASQ in terms of division. For these matrices from industrial 
application we have about 20% speedup. For these glued matrices which are difficult 
for dqds algorithm our algorithm has even more speedup, about 2x-5x times faster. 



Figure |6. 3(b) supplies us more information. For these difficult matrices, Ml deflates 



about 40% of the singular values. Ml also helps some of these industrial matrices. 



Figure |6.3(b) further supports our conclusion that Ml works very well for these tough 
matrices for dqds algorithm. 

For these 320 matrices in stetester for which xLASQ needs more than 0.01 x n 2 
divisions, M5 saves about 20% operations for all these matrices. Among these 320 
matrices, our algorithm is 4x faster for 10 of them; 2x faster for 15 of them; 1.5x 
faster for 33 of them. The results are shown in Figure |6.4| 

6.3. The comparison of accuracy. To show the accuracy of singular values 
computed by our algorithm, we compared the singular values obtained by our algo- 
rithm with those gotten from bisection algorithm. We first change the singular value 
problem of a bidiagonal matrix into the eigenvalue problem of a tridiagonal matrix 
of double size with zero diagonals, and then use bisection algorithm to solve it. We 
assume the eigenvalues computed by bisection algorithm are 'correct'. The results of 
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Fig. 6.1. Results for these tough matrices 
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Fig. 6.2. More results for Matrixes in Table\6.1 



maximum relative error are shown in Figure 6.5(a) Figure 6.5(b) shows the 2-norm 
of relative errors. 



We also compare the accuracy with xLASQ. The results from Figure 6.6(a) and 
Figure 6.6(b)| show our algorithm is usually more accurate than xLASQ. 



6.4. Comparisons with Aggressive Early Deflation. The Aggressive Early 
Deflation strategy [12] is designed to enhance dqds algorithm. This deflation strat- 
egy can help a lot for matrices which are not difficult for dqds algorithm and large 



dimensions. In this subsection, we use the matrices in Table 6.1 to show that AggDef 
does not help a lot for difficult matrices. To compare the accuracy, we also compare 
them with the result computed by bisection. There are two versions of aggressive 
early deflation in [T2], denoted by AggDefl and AggDef2, see [T2] for details. 

From the results in Table |6.6| we can see that for these tough matrices AggDefl 
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Table 6.5 
More bidiagonal matrices 





n 


Description of the bidiagonal matrix B 


Sourse 


1 


5000 


s/Oi = n + 1 — i, y/ei = 1 


M 


2 


5000 


Cholcsky factor of tridiagonal (1,2,1) matrix 


ma 


3 


5000 


Cholesky factor of Wilkinson-type matrix 


ma 


4 


5000 


Cholesky factor of Clement-type matrix 


ma 


5 


2807 


Cholesky factor of Glued Wilkinson matrix 


m 


6 


2807 


Cholesky factor of Glued Clement matrix 


m 


7 


15005 


Cholcsky factor of Glued Wilkinson matrix 


ma 


8 


15005 


Cholesky factor of Glued Clement matrix 


ma 


9 


96 


Cholesky factor of T_bccstkm01_2.dat 


EI 


10 


300 


Cholcsky factor of Fann04.dat 


EH 


11 


2910 


Cholesky factor of T_nasa2910_l.dat 


m 


12 


6245 


Cholcsky factor of T_Alemdar.dat 
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Fig. 6.3. More results for matrices from construction and industry 
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Fig. 6.6. The comparison of accuracy with xLASQ 



10000 



and AggDef2 need nearly the same time and have the same accuracy as xLASQ, mdqds 
algorithm is more accurate and fastest. For this random matrix with dimension 5000, 
AggDefl and AggDef2 nearly does not help either. 

Our algorithm and AggDef are kind of orthogonal. AggDef may help for matrices 
which are easy for dqds algorithm, especially when the size of matrix is big; while our 
main contribution of this paper is improving dqds algorithm for difficult matrices. 

7. Conclusions. In this paper we propose some novel deflation strategies which 
make dqds algorithm more robust. We call our algorithm mdqds. Some improved shift 
strategies are also included. Especially, the deflation Ml is 'tailor-made' for disordered 
matrices for dqds algorithm, which is completely different from the classic deflation 
strategies. These improvements together make our algorithm up to llx faster for 
these difficult matrices and 1.2x-4x faster in general. Our algorithm is usually more 
accurate. 
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Table 6.6 
Comparison with AggDef / 121 





Methods 


Matrix.l Matrix_2 Matrix_3 Matrix_4 Matrix_5 


n 




544 1000 1087 1088 5000 


Time 

Time(xLASQ) 
Time(rnethod) 


AggDefl 
AggDef2 
Ours 


1.000 1.000 0.999 1.003 0.998 
1.000 1.000 0.997 1.003 1.000 
4.297 3.543 11.101 4.260 1.211 


Max. Rel. 
Error 


AggDefl 
AggDef2 
xLASQ 
Ours 


6.22e-15 9.55e-15 4.07e-14 2.42e-14 8.65e-15 
6.66e-15 9.33e-15 3.97e-14 2.35e-14 9.52e-15 
6.22e-15 9.54e-15 4.07e-14 2.42e-14 9.47e-15 
3.66e-15 7.99e-15 3.85e-15 5.66e-15 6.27e-15 
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Appendix In this Appendix we prove Theorem |4.1| Our techniques are similar to 
those used in [3j to prove the order of convergence of the Johnson shift. 

We first show that for sufficiently large I the shifts are chosen by twisted shift 
strategy, then prove its order of convergence. We first introduce a theorem from [2], 
which states the convergence of dqds algorithm. 

Theorem .1 (Convergence of the dqds algorithm [2]). Suppose the matrix B 
defined as in (2.1) has positive nonzero elements, and the shift in the dqds algorithm 
is taken so that < s^ < (c^n) 2 holds. Then 

oo 



i=o 



Mor 



lime^=0 (k= 1,2,--- ,n-l), (.2) 

I— too 

lim-^ r = p fc , (fc = l,---,n-l), (.3) 

I— too e>^ ) 

OO 

=al-Y,s {l) (fc = 1, 2, • • • , n), (.4) 

1=0 

°k Z^i—o * 

Based on Theorem [7TI we can prove the following lemma. 

Lemma .2. Under Assumption (A), for all sufficiently large I, the shift will be 
chosen by twisted shift strategy, 



s 



(') 



1 



In , , ,|, , (-5) 



where j n = d min = q n . 

Proof. From Theorem .1 we know converges to zero and qjp > 0, k — 
1, 2, • • • , n — 1. And {qf 1 } gradually become monotonic. 

For sufficiently large I we will have d m i n k = n. And from twisted factorization 
(see page 239-243, (Hi) we get 



- II ? 



z(n) = 1, 

We assume z = (a: 1) T . From ejjP -> 0, we know ||a;|| 2 can be very small for a 
sufficiently large N and shift will be chosen by pj|). 
By equation (13]), 



lim-^-=p fe <l, (fc =!,••■ ,n-l), 



for all I > N' (N' > N), \\x\\ decreases and the shifts would be chosen by twisted shift 
strategy. □ 
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Remarks. If ||a;|| < jg, xLASQ will use as a shift. To prove the main 
conclusion of this Appendix, we assume d m i n always moves to the bottom. The aim 
is analyzing the asymptotic convergence rate of twisted shift strategy. The following 
lemma reveals another good property of twisted shift. It is said that d m i n — qlP 
always converges to zero if using twisted shift. 

Lemma .3. Under the same assumption as in Lemma we have 

oo 
1=0 



lim =4-a 2 n (k = 1, • • • , n - 1); lim g« = 0. 



Proof. Since — > 0, from Lemma [^2] we know 

lim *W = gW > 0. 



Furthermore, since lim^oo = from equation in Theorem Q] we have 
lim/^oo gi'* 1 = 0. □ 

We are now ready to pro ve the main conclusion. 



Proof of Theorem 

Algorithm [2j we have 



4.1 



First, we compute the order of convergence of e^Li- By 



(0 (0 

„(J+i) _ A l+r > g " _ ,(0 - fl (0 _ P (0 g " _ ,(0 

•in — u n-l — "« n— 1 

?n-l 9 n _i 

— Hn e n-l * ' 

By Lemma f2| the shift is chosen by our twisted shift strategy for sufficiently large I, 
and we have 

„(!+!) = „(l) _ _ fl (0 1 ~ H x ll 

Hn Hn "- 1 y ™ 1 + llxll 2 



i + IWI 



where 



.(') 



o <0 

-n-2 



A = 



-■VTT< 



Hence 



\ a (l) \ a (l) \ a [l) 

\ 1n-l \ 1n-2 \ 1 n -l 



in ~ In V c n-l r— 77 — c n-l ' 

(i + NI 2 )V?i-i 



(.6) 



On the other hand, from the fourth line of Algorithm [2] we also have 
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which implies 

(1+2) (1+2) (Z+l) + 

U + l) _ e n-l gn-1 , (2) _ e n-l 1n-l 

In - duu In ~ (j) 

e ri-l e n-l 

Plugging these equations into ([^6|, we obtain 

(2+2) (2+2) (i+i) (z+i) . s /=— ■ — 

/ (i) U + IFIIJ V-L + a (i+i) 



JO 



*2<n i i 6„ 



which can be rewritten as 

fe (i+1) V 

,g+a) _ -,m v "~ x ; 

where 

(;) i / g i i + 1) (i + |M|) % /r+^ 



Since y/l + a^gz -+ 1, e^ 3 -> 0, and converges to cr^ - ^ > 0, 

it follows that converges to 1 as well. This implies that the sequence {loge^i} 
asymptotically satisfies a second order linear recursion: 

Ioge£t? = 21oge^ 1 1 ) - \ loge^ + log7 W 
We hence conclude that 

loge^ = 0{J3 l ), 
where (3 > 1 satisfies the equation 

In other words, /3 = 1 + ^ » 1.7. Hence the order of convergence for the sequence 
{e^jis/3. " □ 



